Latent Topic Models for Hypertext
نویسندگان
چکیده
Latent topic models have been successfully applied as an unsupervised topic discovery technique in large document collections. With the proliferation of hypertext document collection such as the Internet, there has also been great interest in extending these approaches to hypertext [6, 9]. These approaches typically model links in an analogous fashion to how they model words the document-link co-occurrence matrix is modeled in the same way that the document-word co-occurrence matrix is modeled in standard topic models. In this paper we present a probabilistic generative model for hypertext document collections that explicitly models the generation of links. Specifically, links from a word w to a document d depend directly on how frequent the topic of w is in d, in addition to the in-degree of d. We show how to perform EM learning on this model efficiently. By not modeling links as analogous to words, we end up using far fewer free parameters and obtain better link prediction results.
منابع مشابه
HTM: A Topic Model for Hypertexts
Previously topic models such as PLSI (Probabilistic Latent Semantic Indexing) and LDA (Latent Dirichlet Allocation) were developed for modeling the contents of plain texts. Recently, topic models for processing hypertexts such as web pages were also proposed. The proposed hypertext models are generative models giving rise to both words and hyperlinks. This paper points out that to better repres...
متن کاملTopic Models for Hypertext: How Many Words is a Single Link Worth ?
Latent topic models have been successfully applied as an unsupervised learning technique on various types of data such as text documents, images and biological data. In recent years, with the rapid growth of the Internet, these models have also been adapted to hypertext data. Explicitly modeling the generation of both words and links has been shown to improve inferred topics and open a new rang...
متن کاملFrom Latent Semantics to Spatial Hypertext
In this paper, we describe an integrated approach to the development of virtual reality-enabled spatial hypertext. This approach integrates several fundamentally related tasks into a cohesive and automated process, including latent semantic indexing, transformation between semantic and spatial models, and virtual reality modelling. The design of the visual user interface draws upon the theory o...
متن کاملA Joint Semantic Vector Representation Model for Text Clustering and Classification
Text clustering and classification are two main tasks of text mining. Feature selection plays the key role in the quality of the clustering and classification results. Although word-based features such as term frequency-inverse document frequency (TF-IDF) vectors have been widely used in different applications, their shortcoming in capturing semantic concepts of text motivated researches to use...
متن کاملEvaluating and Extending Latent Methods for Link-Based Classification
Data describing networks such as social networks, citation graphs, hypertext systems, and communication networks is becoming increasingly common and important for analysis. Research on link-based classification studies methods to leverage connections in such networks to improve accuracy. Recently, a number of such methods have been proposed that first construct a set of latent features or links...
متن کامل